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ABSTRACT 


Diabetes is observed to be among the most perilous diseases and chronic diseases. A multitude of problems arise if the disease is left unattended and untreated. The 
prosaic task of identification of the problem aggregates to a patient visiting a doctor in a medical center for deliberation. However, the rise of machine learning 
methodologies solve this severe problem. The incentive of this study is to examine the model which can prefigure the plausibility of diabetes in patients with maximal 
accuracy. Thereupon, the four algorithms namely Decision Tree, Random Forest, Naive Bayes and Adaboost Classifier are utilized in this research for predicting 
diabetes at an initial stage. This paper aims at testing befitting algorithms for the prediction of diabetes. 





Experiments are conducted on two datasets namely Pima India Diabetes Database (PIDD) which is referenced from UCI machine learning repository and an auxiliary 
database. The efficiency of all four algorithms are assessed on the basis of Accuracy, Precision, Recall and F-Measure. Using these all four algorithms discussed above 
the result that was acquired reveal Adaboost Classifier exceeds the other algorithms with the highest accuracy of 85.5% for PIDD and 95.4% for the aforementioned 


dataset. The result that was obtained using Receiver Operating Characteristic (ROC) curves ina sequential manner. 
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1. INTRODUCTION: 

Diabetes, scientifically known as Diabetes Mellitus is a recurrent chronic disease 
which threatens human health. It is a condition that impedes body's ability to pro- 
cess blood sugar. Malfunction of insulin hormone causes increase in blood sugar 
levels. Unprocessed high blood glucose from diabetes have dire effects and 
cause disfunctioning of various nerves, especially eyes, kidneys, and other 
organs (Krasteva et al., 2011).With the development and progression of living 
standards , diabetes has become intensively common among people. Obesity, 
absence of physical activity, smoking, unhealthy diet integrates towards diabetic 
conditions leading to complications in many parts of the body and amplifying the 
risk of premature death. According to the Hindu, approximated number of adult 
people suffering from diabetes in India are evaluated at 77 million. The fre- 
quency in urban areas is in the range of 10.9%-14.2%, and rural areas in range of 
3.0-7.8%. 


Diabetes is categorised in two categories, type | diabetes (T1D) and type 2 diabe- 
tes (T2D) . Type | diabetes, otherwise called juvenile diabetes, is a chronic situa- 
tion which occurs when pancreas fail to produce insulin causing the patients to be 
insulin-dependent (Iancu et al., 2008).The symptoms include increased thirst 
and hunger, frequent urination and blurry vision. Type 2 diabetes arises from the 
lifestyle factors and genetics where the body doesn't utilize insulin correctly, 
developing abnormal blood sugar levels. It occurs more prominently in individu- 
als aged 45 or above (Robertson et al., 2011). 


Thereupon, means to rapidly discover and analyze the condition of diabetes and 
risks involved with it is a topic worthy of study. Alarming rise in the cases of dia- 
betes globally, has made it important to come up with solutions for early stage 
detection of the disease. Machine learning has been very helpful in the prediction 
of a lot of diseases with the help of its analysis tools it has become a boon in the 
medical field. With the aid of various machine learning algorithms its usage can 
be extended for the prediction of the Diabetes Mellitus and thus, help in curbing 
its rapidity. 


Numerous researchers have applied variegated algorithms for prediction of dia- 
betes mellitus, using algorithms viz SVM, Decision tree, Naive Bayes, Decision 
forest, PCA, J48etc. Mujumdar & Vaidehi (2019, p. 293) juxtaposed between 
machine learning algorithms for predicting presence of diabetes. Zou et al. 
(2018, p. 515) recognized diabetes from normal people by using principal com- 
ponent analysis (PCA) and minimum redundancy maximum relevance (mRMR) 
for dimensionality reduction. Joshi & Chawan, (2018, p. 12) used three different 
supervised machine learning methods namely SVM, Logistic regression, ANN. 


Machine learning algorithms like decision tree has become very popular in the 
prediction of Diabetes Mellitus, due to its apt classification. This research deals 
with gestation diabetes and makes use of data of females above the age of 21 
years. 


The algorithms used in this research include Decision tree, Random forest, Naive 
Bayes and AdaBoost. Experimental efficiency of the above mentioned algo- 
rithms are compared anda conclusion is drawn. 


The first dataset used for this study contains medical detail of 768 instances 
which are female patients. The dataset contains 8 attributes where value of one 
class '0' treated as tested negative for diabetes and value of another class '1' is 
treated as tested positive for diabetes. 


The second dataset used for this study contains 15000 observations of females 
above the age of 21. It has 9 attributes i.e PatientID, Pregnancies, 
PlasmaGlucose, DiastolicBloodPressure, TricepsThickness, SerumInsulin, 
BML, DiabetesPedigree and Age. 


2. RELATED WORK: 

To address the problem of growing cases of diabetes a large body of research had 
been conducted for its detection. Most prior work in this area focused on using 
horde of learning techniques to obtain human relevant judgments for evaluation 
of diabetes. 


Many researchers are using varied machine learning techniques and algorithms 
to obtain relevant judgments and results for evaluation of presence of diabetes 
Several approaches provided support for arriving at optimized conclusions. 


For example, working on designing a model for the prediction of Diabetes 
Mellitus where the research aimed at establishing a relation between age and dia- 
betes. It made use of Decision Tree for the prediction of diabetes which gives sat- 
isfactory results. Regression technique was added with a randomization code 
that helped in the prediction of age along with the prediction of diabetes Orabi et 
al. in (2009). 


Q Zhou et al. designed prediction model for Diabetes Mellitus using the dataset 
from hospital physical examination in Luzhou, China. Decision tree , random for- 
est and neural network were used for the prediction . 


D Sisodia et al. in (2018) researched on a model for the early prediction of diabe- 
tes using SVM, Naive Bayes and Decision Tree on Prima Indian dataset. The 
research suggested that Naive Bayes had the highest prediction among the afore 
mentioned algorithms. Joshi & Chawan,(2018, p. 12) centralized their study 
essentially on three supervised machine learning methods: SVM, Logistic 
regression, ANN. 


N Yuvraj et al. In(2017) worked on the prediction of diabetes using Hadoop clus- 
tering to be able to deal with enormous amount of unstructured data and find the 
applicability in modern healthcare systems. Pima India Diabetes Dataset was 
used for the prediction of diabetes in this model. 


Multiple algorithms were applied on the dataset and efficiency of each algorithm 

was calculated on the basis of accuracy, precision, recall, F Measure and 
Receiver Operating Curve. In accordance to research done globally, algorithms 
like Naive Bayes, Random Forest , Decision Tree and AdaBoost were applied to 
draw inferences that maybe most accommodating in prediction with meticu- 
lousness. 
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The work exhibited in this paper, takes reference of previous research to provide 
adequate prediction based on how availability of information impacts the study 
conducted showcasing the impact on the decisions of the result. Further, it aims 
to expand the amplitude of diabetes prediction and deliver the results with exacti- 
tude. 


3. METHODOLOGY USED: 

3.1 Model Diagram: 

The flow of the diagram in the figure showcases the analysis performed in evalu- 
ating the model. 





Applied 
Classification 


ae reas Algorithms 
an Decision Tree 
->| Preprocess Data >| Random Forest 
Diabetes Patient Datasets Naive Bayes 
Adaboost 
FIG.1. Proffered procedure is encapsulated in the form of a model 
diagram. 
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3.2 Brief Explanation of Algorithms Used : 

3.2.1 Decision Tree: 

Decision tree is part of Supervised learning algorithm. It is largely used for the 
classification problems. It is a type of shape of a tree which has a node, known as 
a leaf or a decision node. Its purpose is to separate the population into two or 
more similar sets based on the most notable predictors by calculating the entropy 
of each and every attributes. Further, the dataset is distributed by means of the 
variables with the most prominent information gain or minimum entropy. The 
two steps mentioned, are done recursively with the remaining attributes . 


It uses two nodes for classification: internal and external nodes that is linked to 
one another. The inside or internal nodes represent decisions. It implies to be the 


decision making part and the leaf nodes are associated with the labels. 


The assessed performance of Decision tree classifier is depicted by the confusion 
matrix below: 


Table 1. Confusion Matrix of Decision Tree 


TP FP ™\ FN 
DATASET 1 163 27 54 29 
DATASET 2 3253. «290s «1379 = 403 


3.2.2 Random Forest: 

Random Forest is one of the ensemble boosting classification algorithms. It 
works by selecting a training subset randomly from the specified dataset. It 
sequentially trains the AdaBoost learning model by designating the training set 
formulated on the accurate prediction of the prior training. 


It develops a grove of decision trees and predicts the outcome which is centered 
on the decision of mass trees, chosen by the classifier. The usage of multitude of 


trees is to avoid overfitting. 


The assessed performance of Random Forest is depicted by the confusion matrix 
below: 


Table 2. Confusion Matrix of Random Forest 


TP FP TN FN 
DATASET 1 78 12 30 15 
DATASET 2 1649 5275495 


3.2.3 Naive Bayes Classifier: 
Naive Bayes is a supervised learning algorithm in supposition which denotes all 
features as independent and unassociated with each other. 


It works well with data with conditional problems. 
Naive Bayes is a classification technique which implements the Bayes Theorem. 


Applying Bayes Theorem posterior probability P(A|B) can be computed from 
P(A),P(B) and P(BJA). 
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Thence , P(A|B)=P(B|A)*P(A)/P(B) 


Where, 

P(A|B)=posterior probability 
P(B|A)=likelihood 
P(A)=prior probability 
P(B)=evidence 


The assessed performance of Naive Bayes classifier is depicted by the confusion 
matrix below: 


Table 3. Confusion Matrix of Naive Bayes 


TP FP ™ FN 
DATASET 1 75 14 28 14 
DATASET 2 1562 139 «515-334 


3.2.4 Adaboost Classifier: 

Adaboost is an unvarying ensemble technique. It incorporates diverse classifiers 
to extend their accuracy. The underlying conception responsible for Adaboost 
classifier is to line the weights of the classifiers and priming the sample in each 
subset specified iteratively as to stipulate the accurate predictions of unfamiliar 
observations. 


It executes this by designating higher weights for erroneously classified observa- 
tions to provide them with greater probability of being classified. This process 
repeats itself until the entire training data fits shorn of any error. The assessed per- 
formance of Adaboost classifier is depicted by the confusion matrix below: 


Table 4. Confusion Matrix of Adaboost Classifier 


TP FP TN FN 
DATASET 1 84 7 34 13 
DATASET 2 1734 50 «833——Ci«s74 


3.3 Dataset Used: 
This study was conducted on two datasets comprising of diabetic medical details. 


PIDD-Pima Indians Diabetes Dataset 

The aforementioned methodology is assessed on Diabetes Dataset viz. PIDD- 
Pima Indians Diabetes Dataset consisting of medical information of 768 
instances which are female patients. The dataset incorporates 8 attributes in 
which value of one class '0' is used as tested negative for diabetes and value of an 
additional class 'l'is processed as tested positive for diabetes. 


Diabetes from DAT263xLab01 

The antedecent emthodology is evaluated on Diabetes Dataset viz. Diabetes 
from DAT263xLab01 containing 15001 observations of females above the age 
of 21. It comprises of 9 attributes at which value of one class '0'is used as tested 
negative for diabetes and value ofan additional class '1' is processed as tested pos- 
itive for the disease. 


Table 5. Attribute description 





Attribute of PIMA India Diabetes Dataset Attribute of dataset 2 


























(PIDD) 

Pregnancies Patient Id 

Glucose Pregnancies 

Blood Pressure Glucose 

Skin Thickness Blood Pressure 

Insulin Skin Thickness 

BMI Insulin 

Diabetes Pedigree Function BMI 

Age Diabetes Pedigree Function 
Class 0 or 1 Age 











Class 0 or 1 





3.4 Accuracy Measures: 

Decision tree, Adaboost, Random Forest and Naive Bayes algorithm have been 
used in prediction of diabetes prediction. Efficiency of the algorithm were 
judged on the multi-dimensional basis i.e accuracy, F1, precision and recall. The 
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Table 6. Accuracy Measures Description 





Measuring | Significance Formula 


Constraint 





Accuracy | Accuracy determines 
the accuracy in 
prediction of an 
instance 


A=(TP+TN) / (Total no of samples) 





Classifier's 
Correctness/accuracy 


Precision P=TP/(TP+ FP) 





Recall Completeness and 
sensitivity of the 


classifier 


Fl Weighted avg of 
precision and recall 


R=TP /(TP+FN) 





F=2*(P*R) / (P+R) 

















3.4.1 Comparative performance of different classification algorithms in 
accordance to accuracy measures. 


Table 7. Accuracy Measures for the Datasets 



































Algorithm Precision Recall F-Measure | Accuracy % 
Random Forest 0.738 0.688 0.712 81.4 
Adaboost 0.829 0.723 0.772 85.5 
Decision Tree 0.666 0.65 0.658 79.4 
Naive Bayes 0.666 0.666 0.666 78.6 

Algorithm Precision Recall F-Measure | Accuracy % 
Random Forest 0.937 0.893 0.915 94.4 
Adaboost 0.943 0.921 0.932 95.4 
Decision Tree 0.826 0.773 0.799 86.9 
Naive Bayes 0.706 0.393 0.505 74.3 




















The values of different classification algorithms and their performance and vari- 
ous criteria are listed in the above table. 


TP signifies true positive, TN signifies true negative, FN signifies false negative 
and FP false positive. 


Comparison of performance of the aforementioned algorithms are measured on 
the basis of precision, recall, F-measure and accuracy 


4. RESULT: 

It was inferred that Random forest proved to be more efficient than the other algo- 
rithms and achieved highest accuracy measures on PIDD dataset. However, 
Adaboost conferred better results on the second dataset. 


The outcome of the experiments are summarized in the below graphs, where vari- 
ous accuracy measures are compared for Decision Tree, Random Forest, 
Adaboost and Naive Bayes. 





100 > Dataset 1 


Hl Dataset 2 
90 
Hl ; 


Percentage 


Decision Tree Random Adaboost Naive Bayes 
Forest 
Algorithms 


Fig. 2: Accuracy 
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table below describes the attributes of the datasets used for carrying out this 
research. 
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4. CONCLUSION: 


Detection of Diabetes Mellitus at an early stage can help in taking precautionary 
measures. In this research various classification algorithms were put against each 
other and a systematic conclusion was drawn on the basis of performance of the 
algorithms on accuracy measures like precision, recall, F-Measure and accuracy. 
It was seen that for both datasets Adaboost classifier performed the best and has 
achieved comparatively better results. It achieved an accuracy of 85.5% and 
95.4% for first and second datasets respectively. 
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